Phea is a framework for computing electronic [patient] phenotypes:
The intended use case is where you have a database with patient records and want to compute a formula. However, in principle, Phea’s approach need not be restricted to this use case.
Phea passes formulas unmodified to the SQL query, so they need not be
restricted to mathematical operators. The formulas can include any SQL
expression that would be valid in a SELECT statement, such
as CASE WHEN ... expressions and function calls.
To use Phea, you need a DBI connection to the SQL server, generated
by functions such as DBI::dbConnect() or
DatabaseConnector::connect().
Phea also includes a function to create an interactive plot of the phenotype results in one patient. Below is an example of the body mass index formula.
Phea goes over patient records in chronological order. Say you want to compute a formula:
body_mass_index = body_weight / (body_height * body_height)
Phea will compute body_mass_index for each person using
the most recently available data at each point in time.
In this example, the formula depends on body_weight and
body_height. Therefore, their timestamps will be the
timestamps for the resulting body_mass_index. This means
that at every point in time where there is a body_weight
or body_height a record for a given patient, Phea
will try to calculate body_mass_index again.
Suppose a patient had a body_weight = 75 kg measurement
on May 1st, a body_height = 1.8 meters on August 1st, then
a new body_weight = 79 kg on October 1st. Phea would
produce 3 result rows as follows:
May 1st:
body_weight = 75 kg
body_height = NULL
body_mass_index = 75 / (NULL * NULL) = NULL
August 1st:
body_weight = 75 kg
body_height = 1.8 meters
body_mass_index = 75 / (1.8 * 1.8) = 23.14815
October 1st:
body_weight = 79 kg
body_height = 1.8 meters
body_mass_index = 79 / (1.8 * 1.8) = 24.38272
The algorithm continues as long as new records are avaialable. For
example, if on November 1st a new record
body_height = 1.78 meters appears, the result would be:
November 1st:
body_weight = 79 kg
body_height = 1.78 meters
body_mass_index = 75 / (1.78 * 1.78) = 24.93372
At each point in time, the most recently available data is used, unless you specify otherwise when creating a component.
First, you need to provide Phea with your SQL connection. While Phea is not necessarily restricted to PostgreSQL, this is the only SQL flavor where it has been tested.
library(phea)
# Connect to SQL server
dbcon <- DBI::dbConnect(
RPostgres::Postgres(),
host = 'localhost',
port = 7654,
dbname = 'fort',
user = cred$pg$user, # Credentials not declared in this file
password = cred$pg$pass) # Credentials not declared in this file
# Setup Phea and define cdm_new_york3 as default schema.
setup_phea(
connection = dbcon,
schema = 'cdm_new_york3')
setup_phea() will save dbi_connection and
def_schema for use by functions sql0() and
sqlt(). Most importantly, however, it will install two
user-defined functions on the SQL server
(phea_coalesce_r_sfunc and
phea_find_last_ignore_nulls), if they are not already
there.
Each formula computation comes from combining three parts:
record sourcescomponentsformulasA record source is any SQL query that provides data you
want to use. Its records are assigned a label rec_name. The
record source is assumed to be a long table, that
is, each row corresponds to one record. At a minimum, each row in a
record source must have a patient identifier
(pid) column and a timestamp (ts) column.
For exemple, suppose you have an OMOP Common Data Model schema called
cdm_new_york3, and you would like to compute a formula
involving a patient’s hemoglobin A1c level.
- Such hemoglobin records are in table measurement inside
schema cdm_new_york3.
- They are identified by the OMOP CDM concept_id of 3004410
(corresponds to LOINC code 4548-4,
Hemoglobin A1c/Hemoglobin.total in Blood).
- The patient identifiers are in column person_id
- The timestamps are in column measurement_datetime.
- In the formula, we want to refer to this value as “hemo_a1c”.
Let’s create the record source:
hemoglobin_query <- sql0('
select * from cdm_new_york3.measurement
where measurement_concept_id = 3004410
')
hemoglobin_record_source <- make_record_source(
records = hemoglobin_query,
ts = measurement_datetime,
pid = person_id)
A component is a specification of what records to pick
from record source. If you just want the most recently
available record, the default values need not be overriden:
hemoglobin_component <- make_component(
input_source = hemoglobin_record_source)
The hemoglobin_component can now be used to call
calculate_formula(). For the sake of the example, let us
suppose the phenotype phen that you want to compute is
merely the square root of the patient’s hemoglobin A1c level.
phen = sqrt(hemoglobin A1c level)
The numeric value hemoglobin A1c is located in column
value_as_number within the record source. The call to
calculate_formula() is then:
phen <- calculate_formula(
components = list(
hemo_a1c = hemoglobin_component),
fml = 'sqrt(hemo_a1c_value_as_number)')
Phea automatically recognizes that
hemo_a1c_value_as_number means column
value_as_number in the SQL table associated with component
hemo_a1c.
Object phen is a lazy table in the
dplyr/dbplyr framework. It can be materialized
into a tibble by calling collect(), or manipulated further
using Phea or dplyr/dbplyr.
dplyr::collect(head(phen)) |> knitr::kable()
| row_id | pid | ts | window | hemo_a1c_value_as_number | value |
|---|---|---|---|---|---|
| 1 | 1 | 2011-05-20 | 00:00:00 | 6.9 | 2.626785 |
| 2 | 1 | 2014-03-07 | 00:00:00 | 6.9 | 2.626785 |
| 3 | 1 | 2016-03-11 | 00:00:00 | 6.9 | 2.626785 |
| 4 | 1 | 2018-03-16 | 00:00:00 | 6.9 | 2.626785 |
| 5 | 1 | 2020-03-20 | 00:00:00 | 6.9 | 2.626785 |
| 6 | 1 | 2022-02-11 | 00:00:00 | 6.9 | 2.626785 |
To obtain just the SQL code, sql_render() can be used,
or the option .clip_sql in
calculate_formula():
dbplyr::sql_render(phen)
#> <SQL> SELECT
#> "row_id",
#> "pid",
#> "ts",
#> "window",
#> "hemo_a1c_value_as_number",
#> sqrt(hemo_a1c_value_as_number) AS "value"
#> FROM (
#> SELECT
#> *,
#> "ts" - ts AS "window",
#> last_value(row_id) over (partition by "pid", "ts") AS "ts_row"
#> FROM (
#> SELECT
#> "row_id",
#> "pid",
#> "ts",
#> MAX("hemo_a1c_value_as_number") OVER (PARTITION BY "..dbplyr_partion_1") AS "hemo_a1c_value_as_number",
#> MAX("hemo_a1c_ts") OVER (PARTITION BY "..dbplyr_partion_2") AS "hemo_a1c_ts"
#> FROM (
#> SELECT
#> *,
#> SUM(CASE WHEN (("hemo_a1c_value_as_number" IS NULL)) THEN 0 ELSE 1 END) OVER (ORDER BY "pid", "ts" ROWS UNBOUNDED PRECEDING) AS "..dbplyr_partion_1",
#> SUM(CASE WHEN (("hemo_a1c_ts" IS NULL)) THEN 0 ELSE 1 END) OVER (ORDER BY "pid", "ts" ROWS UNBOUNDED PRECEDING) AS "..dbplyr_partion_2"
#> FROM (
#> SELECT
#> row_number() over () AS "row_id",
#> "pid",
#> "ts",
#> last_value(case when "name" = 'gwixomba5v8j' then "value_as_number" else null end) over (partition by "pid", "name" order by "ts" rows between unbounded preceding and 0 preceding) AS "hemo_a1c_value_as_number",
#> last_value(case when "name" = 'gwixomba5v8j' then "ts" else null end) over (partition by "pid", "name" order by "ts" rows between unbounded preceding and 0 preceding) AS "hemo_a1c_ts"
#> FROM (
#> SELECT
#> 'gwixomba5v8j' AS "name",
#> "person_id" AS "pid",
#> "measurement_datetime" AS "ts",
#> "value_as_number"
#> FROM (
#>
#> select * from cdm_new_york3.measurement
#> where measurement_concept_id = 3004410
#>
#> ) "q01"
#> ) "q02"
#> ) "q03"
#> ) "q04"
#> ) "q05"
#> ) "q06"
#> WHERE ("row_id" = "ts_row")
Notice that the result of the
sqrt(hemo_a1c_value_as_number) is in column
value. This can be changed by giving a name to the formula
in the call to calculate_formula():
phen <- calculate_formula(
components = list(
hemo_a1c = hemoglobin_component),
fml = list(
a1c_sqrt = 'sqrt(hemo_a1c_value_as_number)'))
dplyr::collect(head(phen)) |> knitr::kable()
| row_id | pid | ts | window | hemo_a1c_value_as_number | a1c_sqrt |
|---|---|---|---|---|---|
| 1 | 1 | 2011-05-20 | 00:00:00 | 6.9 | 2.626785 |
| 2 | 1 | 2014-03-07 | 00:00:00 | 6.9 | 2.626785 |
| 3 | 1 | 2016-03-11 | 00:00:00 | 6.9 | 2.626785 |
| 4 | 1 | 2018-03-16 | 00:00:00 | 6.9 | 2.626785 |
| 5 | 1 | 2020-03-20 | 00:00:00 | 6.9 | 2.626785 |
| 6 | 1 | 2022-02-11 | 00:00:00 | 6.9 | 2.626785 |
Multiple formulas can be computed at once:
phen <- calculate_formula(
components = list(
hemo_a1c = hemoglobin_component),
fml = list(
a1c_sqrt = 'sqrt(hemo_a1c_value_as_number)',
a1c_cubed = 'hemo_a1c_value_as_number * hemo_a1c_value_as_number * hemo_a1c_value_as_number'))
dplyr::collect(head(phen)) |> knitr::kable()
| row_id | pid | ts | window | hemo_a1c_value_as_number | a1c_sqrt | a1c_cubed |
|---|---|---|---|---|---|---|
| 1 | 1 | 2011-05-20 | 00:00:00 | 6.9 | 2.626785 | 328.509 |
| 2 | 1 | 2014-03-07 | 00:00:00 | 6.9 | 2.626785 | 328.509 |
| 3 | 1 | 2016-03-11 | 00:00:00 | 6.9 | 2.626785 | 328.509 |
| 4 | 1 | 2018-03-16 | 00:00:00 | 6.9 | 2.626785 | 328.509 |
| 5 | 1 | 2020-03-20 | 00:00:00 | 6.9 | 2.626785 | 328.509 |
| 6 | 1 | 2022-02-11 | 00:00:00 | 6.9 | 2.626785 | 328.509 |
If you need to use the result of one formula in the next one, set
.cascaded = TRUE:
phen <- calculate_formula(
components = list(
hemo_a1c = hemoglobin_component),
fml = list(
a1c_sqrt = 'sqrt(hemo_a1c_value_as_number)',
a1c_half_sqrt = 'a1c_sqrt / 2'),
.cascaded = TRUE)
dplyr::collect(head(phen)) |> knitr::kable()
| row_id | pid | ts | window | hemo_a1c_value_as_number | a1c_sqrt | a1c_half_sqrt |
|---|---|---|---|---|---|---|
| 1 | 1 | 2011-05-20 | 00:00:00 | 6.9 | 2.626785 | 1.313393 |
| 2 | 1 | 2014-03-07 | 00:00:00 | 6.9 | 2.626785 | 1.313393 |
| 3 | 1 | 2016-03-11 | 00:00:00 | 6.9 | 2.626785 | 1.313393 |
| 4 | 1 | 2018-03-16 | 00:00:00 | 6.9 | 2.626785 | 1.313393 |
| 5 | 1 | 2020-03-20 | 00:00:00 | 6.9 | 2.626785 | 1.313393 |
| 6 | 1 | 2022-02-11 | 00:00:00 | 6.9 | 2.626785 | 1.313393 |
So far the examples showed formulas with only one component. Suppose
you would like to compute:
phen = sqrt([hemoglobin a1c]) * [patient's age at the time of measurement]
To obtain the patient’s age, we can collect the date of birth from
table person, then create a component from it.
person_records <- make_record_source(
records = sqlt(person),
ts = birth_datetime,
pid = person_id)
person_component <- make_component(person_records)
phen <- calculate_formula(
components = list(
hemo_a1c = hemoglobin_component,
person = person_component),
fml = list(
pat_age = 'age(hemo_a1c_measurement_datetime, person_birth_datetime)',
a1c_times_age = 'hemo_a1c_value_as_number * extract(year from pat_age)'),
.cascaded = TRUE)
dplyr::collect(head(phen)) |> knitr::kable()
| row_id | pid | ts | window | hemo_a1c_measurement_datetime | person_birth_datetime | hemo_a1c_value_as_number | pat_age | a1c_times_age |
|---|---|---|---|---|---|---|---|---|
| 8 | 1 | 1974-03-01 | 00:00:00 | NA | 1974-03-01 | NA | NA | NA |
| 1 | 1 | 2011-05-20 | 13594 days | 2011-05-20 | 1974-03-01 | 6.9 | 37 years 2 mons 19 days | 255.3 |
| 2 | 1 | 2014-03-07 | 14616 days | 2014-03-07 | 1974-03-01 | 6.9 | 40 years 6 days | 276.0 |
| 3 | 1 | 2016-03-11 | 15351 days | 2016-03-11 | 1974-03-01 | 6.9 | 42 years 10 days | 289.8 |
| 4 | 1 | 2018-03-16 | 16086 days | 2018-03-16 | 1974-03-01 | 6.9 | 44 years 15 days | 303.6 |
| 5 | 1 | 2020-03-20 | 16821 days | 2020-03-20 | 1974-03-01 | 6.9 | 46 years 19 days | 317.4 |
Phea includes in the result all data points used in the formula
computation. If other columns from the same rows are wanted,
export = can be used:
phen <- calculate_formula(
components = list(
hemo_a1c = hemoglobin_component,
person = person_component),
fml = list(
pat_age = 'age(hemo_a1c_measurement_datetime, person_birth_datetime)',
a1c_times_age = 'hemo_a1c_value_as_number * extract(year from pat_age)'),
.cascaded = TRUE,
export = c(
'hemo_a1c_provider_id',
'hemo_a1c_visit_occurrence_id',
'person_gender_concept_id'))
dplyr::collect(head(phen)) |> knitr::kable()
| row_id | pid | ts | window | hemo_a1c_measurement_datetime | person_birth_datetime | hemo_a1c_value_as_number | hemo_a1c_provider_id | hemo_a1c_visit_occurrence_id | person_gender_concept_id | pat_age | a1c_times_age |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 1 | 1974-03-01 | 00:00:00 | NA | 1974-03-01 | NA | NA | NA | 8507 | NA | NA |
| 1 | 1 | 2011-05-20 | 13594 days | 2011-05-20 | 1974-03-01 | 6.9 | 22084 | 21 | 8507 | 37 years 2 mons 19 days | 255.3 |
| 2 | 1 | 2014-03-07 | 14616 days | 2014-03-07 | 1974-03-01 | 6.9 | 27 | 24 | 8507 | 40 years 6 days | 276.0 |
| 3 | 1 | 2016-03-11 | 15351 days | 2016-03-11 | 1974-03-01 | 6.9 | 22084 | 19 | 8507 | 42 years 10 days | 289.8 |
| 4 | 1 | 2018-03-16 | 16086 days | 2018-03-16 | 1974-03-01 | 6.9 | 22084 | 18 | 8507 | 44 years 15 days | 303.6 |
| 5 | 1 | 2020-03-20 | 16821 days | 2020-03-20 | 1974-03-01 | 6.9 | 22084 | 17 | 8507 | 46 years 19 days | 317.4 |
When creating a component, arguments line and
delay can be used to pick records other than the one most
recently available.
line causes records to be skipped in reverse
chronological order. line = 0 (default) gives the most
recent record available. line = 1 skips 1 record, that is,
gives the second most recent. line = 2 skips 2 records, and
so on.
hemoglobin_component <- make_component(
input_source = hemoglobin_record_source)
previous_hemoglobin_component <- make_component(
input_source = hemoglobin_record_source,
line = 1)
phen <- calculate_formula(
components = list(
hemo_a1c = hemoglobin_component,
prev_hemo_a1c = previous_hemoglobin_component),
fml = list(
a1c_difference = 'hemo_a1c_value_as_number - prev_hemo_a1c_value_as_number'),
export = c(
'hemo_a1c_visit_occurrence_id',
'prev_hemo_a1c_visit_occurrence_id'))
dplyr::collect(head(phen)) |> knitr::kable()
| row_id | pid | ts | window | hemo_a1c_value_as_number | prev_hemo_a1c_value_as_number | hemo_a1c_visit_occurrence_id | prev_hemo_a1c_visit_occurrence_id | a1c_difference |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 2011-05-20 | 00:00:00 | 6.9 | NA | 21 | NA | NA |
| 2 | 1 | 2014-03-07 | 1022 days | 6.9 | 6.9 | 24 | 21 | 0 |
| 3 | 1 | 2016-03-11 | 735 days | 6.9 | 6.9 | 19 | 24 | 0 |
| 4 | 1 | 2018-03-16 | 735 days | 6.9 | 6.9 | 18 | 19 | 0 |
| 5 | 1 | 2020-03-20 | 735 days | 6.9 | 6.9 | 17 | 18 | 0 |
| 6 | 1 | 2022-02-11 | 693 days | 6.9 | 6.9 | 29 | 17 | 0 |
delay imposes a minimum time difference between the
timestamp of the component and the timestamp of the phenotype. They are
defined in SQL language. For example, delay = '1 day' gives
the most recently available record that is at least 1 day older than the
phenotype. If there are no such records, the value is empty.
hemoglobin_component <- make_component(
input_source = hemoglobin_record_source)
old_hemoglobin_component <- make_component(
input_source = hemoglobin_record_source,
delay = '6 years')
phen <- calculate_formula(
components = list(
hemo_a1c = hemoglobin_component,
old_hemo_a1c = old_hemoglobin_component),
fml = list(
six_year_diff = 'hemo_a1c_value_as_number - old_hemo_a1c_value_as_number'),
export = c(
'hemo_a1c_visit_occurrence_id',
'old_hemo_a1c_visit_occurrence_id'))
dplyr::collect(head(phen)) |> knitr::kable()
| row_id | pid | ts | window | hemo_a1c_value_as_number | old_hemo_a1c_value_as_number | hemo_a1c_visit_occurrence_id | old_hemo_a1c_visit_occurrence_id | six_year_diff |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 2011-05-20 | 00:00:00 | 6.9 | NA | 21 | NA | NA |
| 2 | 1 | 2014-03-07 | 00:00:00 | 6.9 | NA | 24 | NA | NA |
| 3 | 1 | 2016-03-11 | 00:00:00 | 6.9 | NA | 19 | NA | NA |
| 4 | 1 | 2018-03-16 | 2492 days | 6.9 | 6.9 | 18 | 21 | 0 |
| 5 | 1 | 2020-03-20 | 2205 days | 6.9 | 6.9 | 17 | 24 | 0 |
| 6 | 1 | 2022-02-11 | 2898 days | 6.9 | 6.9 | 29 | 24 | 0 |
As of now, line and delay
cannot be used simultaneously in the same component.
This is a planned feature. If both are provided, line is
considered and delay is ignored. delay is only
considered when line = NA.